Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Don't use NCCL_BLOCKING_WAIT #29562

Merged
merged 7 commits into from
Nov 17, 2022
Merged

Conversation

amogkam
Copy link
Contributor

@amogkam amogkam commented Oct 21, 2022

Signed-off-by: Amog Kamsetty [email protected]

From the pytorch docs, we should use NCCL_ASYNC_ERROR_HANDLING instead.

Closes #29419

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Amog Kamsetty <[email protected]>
)
os.environ["NCCL_BLOCKING_WAIT"] = "1"
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the test plan for failure behavior ? iiuc documentation says NCCL_ASYNC_ERROR_HANDLING is more performant but crashes the process, but NCCL_BLOCKING_WAIT will provide errors to the user which can be caught and handled --> this has implication of ray trainer's error handling semantics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should trigger this code path and make sure the crash output provides enough information to the user before merging. I don't think we can do much better than crashing unfortunately.

Copy link
Contributor Author

@amogkam amogkam Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we should do it. Any suggestions on how to trigger this code path? Couldn't think of an easy way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Launch data-parallel training (minimum two actors) that use NCCL to do the allreduce. Make one of the actors enter a while True: sleep loop so that it never enters the allreduce. Then, after 30 minutes, you'll see how PyTorch crashes the process. Will be even easier if you reduce the timeout ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, looks like an exception is being raised

(RayTrainWorker pid=13803) [E ProcessGroupNCCL.cpp:737] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, Timeout(ms)=5000) ran for 7751 milliseconds before timing out.
(RayTrainWorker pid=13803) [E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
(RayTrainWorker pid=13803) [2022-10-21 16:23:36,638 E 13803 13875] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, Timeout(ms)=5000) ran for 7751 milliseconds before timing out.
(RayTrainWorker pid=13803) [2022-10-21 16:23:36,648 E 13803 13875] logging.cc:104: Stack trace: 
(RayTrainWorker pid=13803)  /home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(+0xc74dda) [0x7f0934867dda] ray::operator<<()
(RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/ray/_raylet.so(+0xc77598) [0x7f093486a598] ray::TerminateHandler()
(RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xacf6f) [0x7f0933b2af6f] __cxxabiv1::__terminate()
(RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xacfb1) [0x7f0933b2afb1] __cxxabiv1::__unexpected()
(RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xacf6c) [0x7f0933b2af6c] __cxxabiv1::__terminate()
(RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN4c10d16ProcessGroupNCCL8WorkNCCL15handleNCCLGuardEv+0x19f) [0x7efbc5ae2d4f] c10d::ProcessGroupNCCL::WorkNCCL::handleNCCLGuard()
(RayTrainWorker pid=13803) /home/ray/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so(_ZN4c10d16ProcessGroupNCCL15workCleanupLoopEv+0x199) [0x7efbc5ae71c9] c10d::ProcessGroupNCCL::workCleanupLoop()
(RayTrainWorker pid=13803) /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xc9039) [0x7f0933b47039] execute_native_thread_routine
(RayTrainWorker pid=13803) /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f09354e5609] start_thread
(RayTrainWorker pid=13803) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f093540a133] __clone
(RayTrainWorker pid=13803) 

But the Ray Actor is still alive, causing training to hang. @rkooo567 do you know why the actor is not terminating when receiving this exception?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ray actor still alive? I think the process that contained the ray actor should be killed by SIGABRT https://github.com/ray-project/ray/blob/master/src/ray/util/logging.cc#L106

Copy link
Contributor Author

@amogkam amogkam Oct 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the actor is still alive. Not sure why the std::abort() is not being captured.

Note, that the std:abort() is not being run in the main thread, but from what I understand, it should kill the entire process.

Copy link
Contributor Author

@amogkam amogkam Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test

python/ray/train/torch/config.py Outdated Show resolved Hide resolved
)
os.environ["NCCL_BLOCKING_WAIT"] = "1"
os.environ["NCCL_ASYNC_ERROR_HANDLING"] = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should trigger this code path and make sure the crash output provides enough information to the user before merging. I don't think we can do much better than crashing unfortunately.

Signed-off-by: Amog Kamsetty <[email protected]>
@amogkam
Copy link
Contributor Author

amogkam commented Oct 22, 2022

Blocked on #29576

# NCCL should timeout.
if session.get_world_rank() == 0:
while True:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we have a time.sleep here? that way we don't consume an entire CPU unnecessarily 🔥

Signed-off-by: amogkam <[email protected]>
@jiaodong
Copy link
Member

hmm wait what's the status of #29576 don't we need it first before changing our recovery ?

@amogkam amogkam merged commit 2631806 into ray-project:master Nov 17, 2022
@amogkam amogkam deleted the nccl-block branch November 17, 2022 18:12
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
From the pytorch docs, we should use NCCL_ASYNC_ERROR_HANDLING instead.

Signed-off-by: amogkam <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] NCCL_BLOCKING_WAIT=1 leads to performance regression
5 participants